Simple cryptography

One of the oldest methods of encoding a message is to construct a table of character transformations. This was typical of military and diplomatic codes used 100 years ago.

Message:

PULL THE TROOPS ON YOUR LEFT FLANK BACK TO THE BRIDGE.

Table

Encoded message:

MYCC USL UINNMO NB WNYI CLFU FCGBX HGJX UN USL HIKDL.

This method of encryption is not used today because it is too easy to crack. See for example the Sherlock Holmes story “The Adventure of the Dancing Men” by A. Conan Doyle. In English, some letters such as E and T occur very frequently, while others, such a Q and Z are rare. You can construct a table of how often characters occur and decode the message. For example, the short message has 5 U’s, 5 N’s, 4 C’s and 4 L’s. We can guess that two of these four symbols represent T and E. Seeing the string “USL” twice we can guess that it means “THE” and we are on our way to cracking the code.

[Miano Table 6.1]

Huffman coding

The idea that some letters occur more often than others can be used for data compression.

Why should every letter get 8 bits? We can represent the most common characters (such as space, E, T) in 3 or 4 bits and use 12 or more bits for the unusual characters. This is the idea behind Huffman encoding, invented by D.A. Huffman in 1952.

This idea works as long as some letters (or colors) occur much more often than others. If we are encoding white noise (any character is equally likely) then Huffman encoding will not achieve any compression.

In Huffman coding, we construct a usage table for a typical message [or for a specific message]. We then use the frequencies of usage to construct a binary tree from the bottom up. The binary tree then gives the bit code for each character.

Method:

1) Find the two values with the lowest frequencies and remove them from the pool. Ties can be broken arbitrarily. [see below]

2) Make a new tree node whose branches are the two items from the previous step.

3) Make the frequency of the tree node be the sum of the branch frequencies.

4) Add the new node to the pool.

5) Repeat until the pool has only one symbol

When the tree is done, assign 0 to all the left branches and 1 to all the right branches.

The code for each symbol is found by tracing the tree.

Examples

[Miano pp. 63-65;

typo in middle of page 63: The count for P in the tree should be two;

typos on pages 64, 65: lower right symbol in all trees should be a period.]

In the final code (table 6-4), no code is a prefix to any other code. Thus the code is uniquely decodable. If we could decode the same binary sequence two different ways, the code would be useless. A prefix code is one in which no complete codeword is the same as the start of another codeword. All Huffman codes are prefix codes. Prefix codes are always uniquely decodable. If you represent a code as a binary tree, prefix codes have the property that the codewords are always leaves and never branch points.

Consider these codes:

Letter	Probability	Code 1	Code 2	Code 3	Code 4
a	0.5	0	0	0	0
b	0.25	0	1	10	01
c	0.125	1	00	110	011
d	0.125	10	11	111	0111
Average length:		1.125	1.25	1.75	1.875

Code 1 is useless since 0 could mean either a or b.

Code 2 is also not uniquely decodable since 00 could mean either c or aa.

Code 3 is a prefix code and is uniquely decodable.

Code 4 is not a prefix code but is uniquely decodable.

The average length of a code is found from the sum product of the probabilities and bit lengths.

Minimum Variance

Letter	Probability	Code 6	Code 7
c	0.4	1	10
a	0.2	01	00
b	0.2	000	11
d	0.1	0010	010
f	0.1	0011	011
Average length:		2.2	2.2

Codes 6 and 7 are different Huffman codes that have the same average length per symbol. However, one of them has codewords of size 1 to 4 bits and the other has only 2 and 3 bit codewords. Minimum variance is a desirable property, since it can reduce the buffer size needed on a channel. To get minimum variance, place combined codes as high on the list as possible. In other words, when there are several choices for what to put into a node on the tree, favor single letters.

Why does this matter? Suppose that we have a slow channel that can transmit 22 bits per second. On average, it will handle 10 letters per second. If we experience the unlikely event of a run of 33 d's and f's, code 6 will need 132 bits, but code 7 will only need 99. This run will require 6 seconds with code 6 and 4.5 seconds using code 7. If a letter is being put into the channel every 0.1 second, the first code will need a larger buffer than the second. Eventually we should hit some short code that let us catch up and empty the buffer. The code with the smaller variance can get by with a smaller (and cheaper) buffer.

JPEG Huffman coding

We have shown that a binary tree can be used to generate a table of Huffman codes.

A tree is not the most computationally efficient way to generate the codes.

It is better to first make a table of the code lengths for each symbol.

[Miano, pp.66-67]

Once we know how many bits we need for each symbol, we can use the following algorithm to generate the codes:

void GenerateHuffmanCodes

( int NumberOfCodes, /* in; number of symbols to encode */

unsigned char CodeLengths[], /* in; # bits in each code

index: 0 to NumberOfCodes-1 */

int Codes[] /* out; Huffman code;

index: 0 to NumberOfCodes-1 */

)

{

int HuffmanCodeCounter = 0;

unsigned char CodeLengthCounter = 1;

for (int i = 0; i < NumberOfCodes; )

{

if (CodeLengths[i] == CodeLengthCounter)

{

Codes[i] = HuffmanCodeCounter++;

}

else

{

HuffmanCodeCounter <<= 1;

CodeLengthCounter++;

}

if (i == CodeLengthCounter) i++;

}

[Miano, pp. 68-69]

Notice that GenerateHuffmanCodes() doesn’t know or care what symbols go with the codes. The association of symbols to codes is implied by the order in which the symbols are sorted.

What is the most compact way to store the Huffman codes?

A binary tree structure has lots of overhead.

We could store the Huffman table as arrays of bytes:

Symbol	Code Length	Code (binary)
A	1	0
Space	3	100
N	3	101
L	4	1100
M	4	1101
P	4	1110
C	5	11110
. (period)	5	11111

We can see that we are wasting space for the codes (64 bits). We could replace the third column with: 01001011100110111101111011111

This is 29 bits, or 32 bits if we pad it to fit in four bytes.

In fact, we don’t need to store the third column in our file at all. When we need it, we can generate it in RAM using GenerateHuffmanCodes().

The second column has 8 entries, one for each symbol. If we know that the maximum size of a Huffman code is 5 bits, then we can replace the second column with a list of the number of codes of each length.

implied index	#codes of this length
1	1
2	0
3	2
4	3
5	2

In JPEG, the maximum length of a Huffman code is 16 bits. JPEG Huffman tables are formatted as:

Field size	Description
1 byte	Table class
16 bytes	Count of Huffman codes of size 1 to 16.
Sum of values in previous 16 bytes	1-byte symbols sorted by Huffman code

Limiting the size of Huffman codes

If you want to reduce the maximum size of a Huffman code, you can rearrange tree nodes to make it so.

[Miano fig. 6.2]

Working with an array of length counts, we can reduce a symbol. Find a symbol whose code length is at least 2 less than the length to be reduced. Change this symbol to a branch, and shift the longer symbols.